Analysis of Risk Factors on Heart Disease
In order to examine the impacts of heart disease we wanted to look at different risk factors of heart disease. Some of these risk factors are in your control such as your weight. However other factors such as your age or gender are out of the individuals control. We hope to look at these different risk factors so people can better understand how to avoid heart disease.
Here is a glimpse of the data:
tibble [70,000 × 12] (S3: tbl_df/tbl/data.frame)
$ age : num [1:70000] 18393 20228 18857 17623 17474 ...
$ gender : Factor w/ 2 levels "2","1": 1 2 2 1 2 2 2 1 2 2 ...
$ height : num [1:70000] 168 156 165 169 156 151 157 178 158 164 ...
$ weight : num [1:70000] 62 85 64 82 56 67 93 95 71 68 ...
$ ap_hi : num [1:70000] 110 140 130 150 100 120 130 130 110 110 ...
$ ap_lo : num [1:70000] 80 90 70 100 60 80 80 90 70 60 ...
$ cholesterol: Factor w/ 3 levels "1","3","2": 1 2 2 1 1 3 2 2 1 1 ...
$ gluc : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 1 3 1 1 ...
$ smoke : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ alco : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ active : Factor w/ 2 levels "1","0": 1 1 2 1 2 2 1 1 1 2 ...
$ cardio : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 1 1 ...
age: Age of the individual. (Integer)
gender: Gender of the individual. (String)
height: Height of the individual in centimeters. (Integer)
weight: Weight of the individual in kilograms. (Integer)
ap_hi: Systolic blood pressure reading. (Integer)
ap_lo: Diastolic blood pressure reading. (Integer)
cholesterol: Cholesterol level of the individual. (Integer)
gluc: Glucose level of the individual. (Integer)
smoke: Smoking status of the individual. (Boolean)
alco: Alcohol consumption status of the individual. (Boolean)
active: Physical activity level of the individual. (Boolean)
cardio: Presence or absence of cardiovascular disease. (Boolean)
BMI-(weight/(height^2))*703
Obesity- BMI < 18.5 = “Underweight”, BMI >= 18.5 & BMI < 25 = “Healthy Weight”, BMI >= 25 & BMI < 30 = “Overweight”, BMI >= 30 = “Obese”
Height -
We dropped anyone taller than 7ft (84in) and anyone shorter than 4ft (48in)
BMI -
We dropped anyone with a BMI over 105 and anyone with a BMI lower than 11 as these are impossible values.
Blood Pressure -
We dropped anyone with a Systolic blood pressure over 200. Anything over 180 is dangerous but possible. We dropped anyone with a diastolic blood pressure over 140. Anything over 120 is dangerous but possible. We dropped anyone with a systolic blood pressure under 50. Anything under 50 is considered dangerous. We dropped anyone with a diastolic blood pressure under 30. Anything under 30 is considered dangerous.
As you can see from these two graphs here. Both are unimodal and skewed right for BMI. The main key difference here is that individuals with heart disease seem to have a higher overall BMI level. The whole distribution of individuals with heart disease seems to be shifted slightly to the left. This could possibly be an indication of connection between having heart disease and having a higher BMI.
First, the boxplot of Healthy vs Heart Disease in respect to age and obesity shows that people with Heart disease regardless of obesity tend to have a higher median age. This shows that regardless of your obesity over time age incrreases your risk of heart disease.
The next boxplots are very interesting as it shows Healthy vs Heart Disease in respect to Systolic Blood Pressure and obesity. As you can see from the graph if you have a higher obesity level you will have a higher blood pressure showing there may be a correlation between these two variables. Also, it seems that people with heart disease have a higher median systolic blood pressure. The other blood pressure boxplot shows a similar result between Diastolic blood pressure but there seems to be a higher variation in the data.
The next graphs are the correlation charts. The correlation charts show that the two blood pressures are highly correlated with one another. Also, by splitting the correlation between Healthy and Unhealthy individuals with Heart disease we can see how the data is clustered. For each scatter plot you can see how the majority of individuals who are healthy are clustered to the left indicating that more healthy individuals have lower blood pressure readings, younger, with lower BMIs.
Cholesterol: Cholesterol seems to have a big impact once you get to the highest cholesterol group. The proportion seem to jump tremendously. The trend of as ones BMI rises with there obesity level is still present but for each level it increases greatly with the high cholesterol group.
Smoker: Smoking does not seem to have a significant impact on heart disease. If anything smoking slightly decreases ones chance at heart disease based on the figure. This sounds counter intuitive, however with obesity being such a strong factor nd smoking can cause weight loss it is likely the negative heart effects of smoking is countered by the weight loss.
Drinker: Drinking does not seem to cause a significant impact on heart disease. Grouping by those who drink and those who do not does not change the proportions across different obesity levels.
Glucose: Glucose does seem to heavily effect the proportion of heart disease at different obesity levels. As you can see patients who had the high glucose category increased there chance of having heart disease greatly. This is across all different levels of weight.
Gender: There seems to be no effect of gender for different obesity levels heart disease proportions. Both graphs appear to be almost identical.
Confusion Matrix and Statistics
Reference
Prediction Healthy Heart Disease
Healthy 8140 3348
Heart Disease 2259 6800
Accuracy : 0.7271
95% CI : (0.721, 0.7332)
No Information Rate : 0.5061
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.4534
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.7828
Specificity : 0.6701
Pos Pred Value : 0.7086
Neg Pred Value : 0.7506
Prevalence : 0.5061
Detection Rate : 0.3962
Detection Prevalence : 0.5591
Balanced Accuracy : 0.7264
'Positive' Class : Healthy
In this model the logistic regression model was able to predict whether or not an indivdual had heart disease. The accuracy of the model is 0.7288, which means that 72.88% of the model’s predictions were correct. The sensitivity of the model is 0.7927, which means that it correctly identified 79.27% of the people with heart disease. The specificity is 0.6634, which means that it correctly identified 66.34% of the people with healthy hearts. The positive predictive value (PPV) is 0.7070, which means that if the model predicts that someone has heart disease, there is a 70.70% chance that they actually do. The negative predictive value (NPV) is 0.7574, which means that if the model predicts that someone has a healthy heart, there is a 75.74% chance that they actually do.
In conclusion there seems to be a combination of factors that contribute to heart disease. Some of these factors you are able to control such as drinking, BMI, Glucose, Blood Pressure, and cholesterol. If someone wants to avoid heart disease it is important to try and reduce there risk factors that they are able to control as they age. Age is an uncontrolled risk factor that greatly increases your chance of heart disease overtime. In order to combat this and to avoid heart disease one must try and lower other risk factors.
Revisions and extensions that we would like to make to our original research question would be to not focus on alcohol consumption and smoking as much because we found that they did not have as large of an impact as we thought they would. Going forward it would be interesting to explore cholesterol and glucose levels in relation to BMI to see if people with high cholesterol or glucose levels are more likely to be obese. We found that BMI had a very large impact on heart disease so this would further the study.
We used data from kaggle the data set was done by Kuzak Dempsy. https://en.wikipedia.org/wiki/Jon_Brower_Minnoch#:~:text=He%20died%2023%20months%20later,Body%20Mass%20Index%20of%20105.3.
For additional research on how to decide our limits for impossible values:
https://www.ennonline.net/fex/15/limits#
https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings
https://myallamericanhospice.com/dangerous-low-blood-pressure/
---
title: "Heart Disease Dashboard"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswat: materia
primary: "#0072B2"
orientation: columns
vertical_layout: fill
source_code: embed
running: shiny
---
<style type="text/css">
.chart-title{ /* chart_title */
font-size: 20px;
}
body{
/* Normal*/
font-size: 16px;
}
</style>
```{r setup, include=FALSE}
library(shiny)
library(flexdashboard)
```
Basic Information
===
Column {data-width=600}
---
***Analysis of Risk Factors on Heart Disease
***
### Introduction
In order to examine the impacts of heart disease we wanted to look at different risk factors of heart disease. Some of these risk factors are in your control such as your weight. However other factors such as your age or gender are out of the individuals control. We hope to look at these different risk factors so people can better understand how to avoid heart disease.
Here is a glimpse of the data:
```{r load_data}
library("readr")
library(dplyr)
library(ggplot2)
setwd("C:/MTH 208")
HeartData<- read_csv("heart_data.csv",col_types = "nnnfnnnnfffffff")
options(scipen = 999)
HeartData<-select(HeartData,-c(index,id))
outliers <- function(df,variable){
library(dplyr)
fsum<-fivenum(variable)
IQR<-fsum[4]-fsum[2]
lower.f<-fsum[2]-1.5*IQR
upper.f<-fsum[4]+1.5*IQR
outlier<- df %>% filter(variable<lower.f|variable>upper.f)
return(outlier)
}
str(HeartData)
```
Column ( data-width= 400)
---
### Variable Introduction
age: Age of the individual. (Integer)
gender: Gender of the individual. (String)
height: Height of the individual in centimeters. (Integer)
weight: Weight of the individual in kilograms. (Integer)
ap_hi: Systolic blood pressure reading. (Integer)
ap_lo: Diastolic blood pressure reading. (Integer)
cholesterol: Cholesterol level of the individual. (Integer)
gluc: Glucose level of the individual. (Integer)
smoke: Smoking status of the individual. (Boolean)
alco: Alcohol consumption status of the individual. (Boolean)
active: Physical activity level of the individual. (Boolean)
cardio: Presence or absence of cardiovascular disease. (Boolean)
Outliers and Issues
===
Column {.tabset data-width=800}
---
### Kept Values and New Variables
```{r KeptValues}
df<-HeartData %>% mutate(
height=.394*height,
age=round(age/365),
weight=round(weight*2.205,1),
BMI=round((weight/(height*height))*703,2),
obesity = case_when(
BMI < 18.5 ~ "Underweight",
BMI >= 18.5 & BMI < 25 ~ "Healthy Weight",
BMI >= 25 & BMI < 30 ~ "Overweight",
BMI >= 30 ~ "Obese",)
)
df$gender<-df$gender %>% recode(
'1'="Female",
'2'="Male"
)
df$alco<-df$alco %>% recode(
'0'="Nondrinker",
'1'="Drinker"
)
df$smoke<-df$smoke %>% recode(
'0'="Nonsmoker",
'1'="Smoker"
)
df$cardio<-df$cardio %>% recode(
'0'="Healthy",
'1'="Heart Disease"
)
df$cholesterol<-df$cholesterol %>% recode(
'1'="Normal",
'2'="Above Normal",
'3'="Well Above Normal"
)
df$gluc<-df$gluc %>% recode(
'1'="Normal",
'2'="Above Normal",
'3'="Well Above Normal"
)
HeartClean<-df %>% filter(height<84&BMI<105&height>48&BMI>11&ap_hi<200&ap_hi>50&ap_lo<140&ap_lo>30)
RemovedValues<-df %>% filter(height>84|BMI>105|height<48|BMI<11|ap_hi>200|ap_hi<50|ap_lo>140|ap_lo<30)
DT::datatable(HeartClean, colnames = c("Age","Gender","Height (in)","Weight (lbs)","Systolic blood pressure (mm Hg)", "Diastolic blood pressure (mm Hg)","Cholesterol","Glucose","Smoker","Alcohol","Active","Heart Disease","BMI","Obesity"))
```
### Omitted Values
```{r Omitted}
DT::datatable(RemovedValues, colnames = c("Age","Gender","Height (in)","Weight (lbs)","Systolic blood pressure (mm Hg)", "Diastolic blood pressure (mm Hg)","Cholesterol","Glucose","Smoker","Alcohol","Active","Heart Disease","BMI","Obesity"))
```
Column(data-width=200)
---
### Explanation
BMI-(weight/(height^2))*703
Obesity-
BMI < 18.5 = "Underweight",
BMI >= 18.5 & BMI < 25 = "Healthy Weight",
BMI >= 25 & BMI < 30 = "Overweight",
BMI >= 30 = "Obese"
Height -
We dropped anyone taller than 7ft (84in) and anyone shorter than 4ft (48in)
BMI -
We dropped anyone with a BMI over 105 and anyone with a BMI lower than 11 as these are impossible values.
Blood Pressure -
We dropped anyone with a Systolic blood pressure over 200. Anything over 180 is dangerous but possible.
We dropped anyone with a diastolic blood pressure over 140. Anything over 120 is dangerous but possible.
We dropped anyone with a systolic blood pressure under 50. Anything under 50 is considered dangerous.
We dropped anyone with a diastolic blood pressure under 30. Anything under 30 is considered dangerous.
EDA-1
===
Column{.tabset data-width=750}
---
### Heart Disease BMI
```{r histogramsofBMIbasedonHealthy}
heartdisease<-filter(HeartClean,cardio=="Heart Disease")
noheartdisease<-filter(HeartClean,cardio=="Healthy")
hist(heartdisease$BMI, col = "#E69F00", breaks=20, main ="Histogram of BMI for Heart Disease", xlab="BMI" )
```
### Healthy Disease BMI
```{r histogramsofBMIbasedonHeartDisease}
hist(noheartdisease$BMI, col = "#E69F00",breaks=10, main ="Histogram of BMI for Healthy", xlab="BMI" )
```
Column(data-width=250)
---
### Explanation
As you can see from these two graphs here. Both are unimodal and skewed right for BMI. The main key difference here is that individuals with heart disease seem to have a higher overall BMI level. The whole distribution of individuals with heart disease seems to be shifted slightly to the left. This could possibly be an indication of connection between having heart disease and having a higher BMI.
EDA-2
===
Column{.tabset data-width=750}
---
### Age
```{r boxplotAge}
HeartClean$cardio <- as.factor(HeartClean$cardio)
HeartClean$obesity <- factor(HeartClean$obesity, levels = c("Underweight", "Healthy Weight", "Overweight", "Obese"))
HeartClean$cholesterol <- factor(HeartClean$cholesterol, levels = c("Normal", "Above Normal","Well Above Normal"))
HeartClean$gluc <- factor(HeartClean$gluc, levels = c("Normal", "Above Normal","Well Above Normal"))
library(ggplot2)
library(viridis)
ggplot(HeartClean, aes(x = obesity, y = age, fill = cardio)) +
geom_boxplot() +
scale_fill_manual(values=c("#E69F00", "#0072B2")) +
labs(x = "Obesity", y = "Age", title = "Boxplot of Age by Obesity and Cardiovascular Disease Status",fill="Cardiovascular Disease")
```
### Systolic
```{r boxplotSystolicBloodPressure}
ggplot(HeartClean, aes(x = obesity, y = ap_hi, fill = cardio)) +
geom_boxplot() +
scale_fill_manual(values=c("#E69F00", "#0072B2")) +
labs(x = "Obesity", y = "Systolic Blood Pressure", title = "Boxplot of Systolic BP by Obesity and Cardiovascular Disease Status",fill="Cardiovascular Disease")
```
### Diastolic
```{r DistolicBloodPressure}
ggplot(HeartClean, aes(x = obesity, y = ap_lo, fill = cardio)) +
geom_boxplot() +
scale_fill_manual(values=c("#E69F00", "#0072B2")) +
labs(x = "Obesity", y = "Diastolic Blood Pressure", title = "Boxplot of Diastolic BP by Obesity and Cardiovascular Disease Status",fill="Cardiovascular Disease")
```
### Correlations
```{r correlations}
library(GGally)
HeartCleantest<-select(HeartClean, c(ap_hi,ap_lo,BMI,age,cardio))
HeartCleantest$cardio<-HeartCleantest$cardio %>% recode(
'Heart Disease'="Unhealthy"
)
HeartCleantest <- rename(HeartCleantest, Systolic = ap_hi)
HeartCleantest <- rename(HeartCleantest, Diastolic = ap_lo)
ggpairs(HeartCleantest, columns = 1:4, ggplot2::aes(colour=cardio))+
scale_color_manual(values = c("#E69F00", "#0072B2"))+ scale_fill_manual(values=c("#E69F00", "#0072B2"))+
# Add a title to the plot
ggtitle("Scatterplot Matrix of Heart Disease")
```
Column(data-width=250)
---
### Explanation
First, the boxplot of Healthy vs Heart Disease in respect to age and obesity shows that people with Heart disease regardless of obesity tend to have a higher median age. This shows that regardless of your obesity over time age incrreases your risk of heart disease.
The next boxplots are very interesting as it shows Healthy vs Heart Disease in respect to Systolic Blood Pressure and obesity. As you can see from the graph if you have a higher obesity level you will have a higher blood pressure showing there may be a correlation between these two variables. Also, it seems that people with heart disease have a higher median systolic blood pressure. The other blood pressure boxplot shows a similar result between Diastolic blood pressure but there seems to be a higher variation in the data.
The next graphs are the correlation charts. The correlation charts show that the two blood pressures are highly correlated with one another. Also, by splitting the correlation between Healthy and Unhealthy individuals with Heart disease we can see how the data is clustered. For each scatter plot you can see how the majority of individuals who are healthy are clustered to the left indicating that more healthy individuals have lower blood pressure readings, younger, with lower BMIs.
EDA-3
===
Column{.tabset data-width=750}
---
### Cholesterol
```{r Cholesterol}
ggplot(HeartClean, aes(fill=cardio, x=obesity)) +
geom_bar(position="fill") +
labs(x = "Obesity", y = "Percentage", fill = "Heart Disease",title = "Heart Disease by Obesity and Cholesterol") +
facet_wrap(~cholesterol, ncol=2) +
theme(axis.text.x = element_text(size = 6, angle=45)) +
scale_fill_manual(values=c("#E69F00", "#0072B2"), name="Heart Disease", labels=c("No", "Yes"))
```
### Smoke
```{r smoke}
ggplot(HeartClean, aes(fill=cardio, x=obesity)) +
geom_bar(position="fill") +
labs(x = "Obesity", y = "Percentage", fill = "Heart Disease",title = "Heart Disease by Obesity and Smoker") +
facet_wrap(~smoke, ncol=2) +
theme(axis.text.x = element_text(size = 6, angle=45)) +
scale_fill_manual(values=c("#E69F00", "#0072B2"), name="Heart Disease", labels=c("No", "Yes"))
```
### Alcohol
```{r Alcohol}
ggplot(HeartClean, aes(fill=cardio, x=obesity)) +
geom_bar(position="fill") +
labs(x = "Obesity", y = "Percentage", fill = "Heart Disease",title = "Heart Disease by Obesity and Alcohol") +
facet_wrap(~alco, ncol=2) +
theme(axis.text.x = element_text(size = 6, angle=45)) +
scale_fill_manual(values=c("#E69F00", "#0072B2"), name="Heart Disease", labels=c("No", "Yes"))
```
### Glucose
```{r Glucose}
ggplot(HeartClean, aes(fill=cardio, x=obesity)) +
geom_bar(position="fill") +
labs(x = "Obesity", y = "Percentage", fill = "Heart Disease",title = "Heart Disease by Obesity and Glucose") +
facet_wrap(~gluc, ncol=2) +
theme(axis.text.x = element_text(size = 6, angle=45)) +
scale_fill_manual(values=c("#E69F00", "#0072B2"), name="Heart Disease", labels=c("No", "Yes"))
```
### Gender
```{r Gender}
ggplot(HeartClean, aes(fill=cardio, x=obesity)) +
geom_bar(position="fill") +
labs(x = "Obesity", y = "Percentage", fill = "Heart Disease", title = "Heart Disease by Obesity and Gender") +
facet_wrap(~gender, ncol=2) +
theme(axis.text.x = element_text(size = 6, angle=45)) +
scale_fill_manual(values=c("#E69F00", "#0072B2"), name="Heart Disease", labels=c("No", "Yes"))
```
Column(data-width=250)
---
### Explanation
Cholesterol:
Cholesterol seems to have a big impact once you get to the highest cholesterol group. The proportion seem to jump tremendously. The trend of as ones BMI rises with there obesity level is still present but for each level it increases greatly with the high cholesterol group.
Smoker:
Smoking does not seem to have a significant impact on heart disease. If anything smoking slightly decreases ones chance at heart disease based on the figure. This sounds counter intuitive, however with obesity being such a strong factor nd smoking can cause weight loss it is likely the negative heart effects of smoking is countered by the weight loss.
Drinker:
Drinking does not seem to cause a significant impact on heart disease. Grouping by those who drink and those who do not does not change the proportions across different obesity levels.
Glucose:
Glucose does seem to heavily effect the proportion of heart disease at different obesity levels. As you can see patients who had the high glucose category increased there chance of having heart disease greatly. This is across all different levels of weight.
Gender:
There seems to be no effect of gender for different obesity levels heart disease proportions. Both graphs appear to be almost identical.
Machine Learning Model
===
Column(data-width=400)
---
### Efficency
```{r regress}
library(caret)
HeartCleanRegress<-HeartClean %>% select(-c(3,4,11,14))
trainIndex <- createDataPartition(HeartClean$cardio, p = .7, list = FALSE)
trainData <- HeartCleanRegress[trainIndex,]
testData <- HeartCleanRegress[-trainIndex,]
logistic_model <- train(cardio ~ ., data = trainData, method = "glm", family = "binomial")
predictions <- predict(logistic_model, newdata = testData)
confusionMatrix(predictions, testData$cardio)
# Fit a logistic regression model
logit_model <- glm(cardio ~ age + BMI + gluc + ap_hi + ap_lo + cholesterol + gender+smoke+alco, data = HeartClean, family = "binomial")
saveRDS(logit_model, file = "my_model.rds")
```
Column(data-width=600)
---
### Result
In this model the logistic regression model was able to predict whether or not an indivdual had heart disease. The accuracy of the model is 0.7288, which means that 72.88% of the model's predictions were correct. The sensitivity of the model is 0.7927, which means that it correctly identified 79.27% of the people with heart disease. The specificity is 0.6634, which means that it correctly identified 66.34% of the people with healthy hearts. The positive predictive value (PPV) is 0.7070, which means that if the model predicts that someone has heart disease, there is a 70.70% chance that they actually do. The negative predictive value (NPV) is 0.7574, which means that if the model predicts that someone has a healthy heart, there is a 75.74% chance that they actually do.
Conclusion
===
Column(Data-width=600)
---
### Conclusion
In conclusion there seems to be a combination of factors that contribute to heart disease. Some of these factors you are able to control such as drinking, BMI, Glucose, Blood Pressure, and cholesterol. If someone wants to avoid heart disease it is important to try and reduce there risk factors that they are able to control as they age. Age is an uncontrolled risk factor that greatly increases your chance of heart disease overtime. In order to combat this and to avoid heart disease one must try and lower other risk factors.
Revisions and extensions that we would like to make to our original research question would be to not focus on alcohol consumption and smoking as much because we found that they did not have as large of an impact as we thought they would. Going forward it would be interesting to explore cholesterol and glucose levels in relation to BMI to see if people with high cholesterol or glucose levels are more likely to be obese. We found that BMI had a very large impact on heart disease so this would further the study.
### References
We used data from [kaggle](https://www.kaggle.com/datasets/thedevastator/exploring-risk-factors-for-cardiovascular-diseas) the data set was done by Kuzak Dempsy.
https://en.wikipedia.org/wiki/Jon_Brower_Minnoch#:~:text=He%20died%2023%20months%20later,Body%20Mass%20Index%20of%20105.3.
For additional research on how to decide our limits for impossible values:
https://www.ennonline.net/fex/15/limits#
https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings
https://myallamericanhospice.com/dangerous-low-blood-pressure/
Column(Data-width=400)
---
### About the Authors
Aidan Bramer
I am a Senior pursuing a B.S. in Applied Mathematics in Economics at the University of Dayton.
Connect with me on [LinkedIn](https://www.linkedin.com/in/aidan-bramer-0652b8246?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3Bch9GIwx8SJOQzkvK8KXBSA%3D%3D).
Evan Dolley
I am a Senior pursuing a degree in Mechanical Engineering.
Connect with me on [LinkedIn](https://www.linkedin.com/in/Evan-dolley-46624b180/).